Machine Learning and Pattern Recognition: Case Study in Crop Disease Mapping

Guest lecture, University of Edinburgh, 12.11.2019

John Quinn

This lecture is a case study of how different MLPR techniques can be combined to give the basis of a system to monitor viral crop disease.

In [1]:
from IPython.display import Image, IFrame
IFrame('https://docs.google.com/presentation/d/1RFGHYcsdxg1jdl5ofxFZeD78FfGkTlzj6VsqEm3ZJOY/preview#slide=id.g6b05a563b6_0_0', width=800, height=450)
Out[1]:

Look at some spatial data

Read in data on positions of cassava plants, and whether they are categorised as having Cassava Brown Streak Disease (CBSD) or healthy. We just take a subset of the data for now so that the examples run quickly.

In [2]:
import GPy
import mplleaflet
import pandas as pd
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [32]:
df = pd.read_csv('cbsd_locations.csv')
df = df.loc[::10,:]

plt.plot(df.lon[df.cbsd==0], df.lat[df.cbsd==0], 'yo', markersize=10, 
         alpha=0.5, markeredgecolor='none')
plt.plot(df.lon[df.cbsd==1], df.lat[df.cbsd==1], 'bo', markersize=10, 
         alpha=0.5, markeredgecolor='none')

plt.gcf().set_size_inches(10, 10)

# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')
Out[32]:

How could we make predictions for what's happening across the map?

If we wanted to know what we thought the prevalence of CBSD disease was at a longitude/latitude that we didn't sample, how would we calculate it?

  • We have some inputs $f_i$ (disease status, 0 or 1) at locations $\mathbf{x}_i$ (longitude and latitude).
  • We want to predict $f_*$, the disease status at any unmeasured location $\mathbf{x}_*$.

Don't be put off by the fact that we're dealing with locations on a map: we're basically just trying to learn a function with 2-dimensional inputs and a single output.

In this case, we could use functions where the output is a class (diseased or healthy) or a number (prevalence of disease) at the corresponding location. In other words, this could be modelled as either a classification or a regression problem. Almost any supervised learning method could be used to give some sort of prediction here!

Defining prediction locations

We can construct a grid of coordinates $f_{*} \in \mathbb{R}^2$ covering the spatial extent we're interested in. To get a disease map, we make predictions at all of these points.

In [4]:
GRID_BOUNDS = (29.571499, 35.000273, -1.47887, 4.234466)

grid_points_per_degree = 10.
grid_height = int((GRID_BOUNDS[1] - GRID_BOUNDS[0]) * grid_points_per_degree)
grid_width = int((GRID_BOUNDS[3] - GRID_BOUNDS[2]) * grid_points_per_degree)
x1 = np.linspace(GRID_BOUNDS[0], GRID_BOUNDS[1], grid_width)
x2 = np.linspace(GRID_BOUNDS[2], GRID_BOUNDS[3], grid_height)
xv, yv = meshgrid(x1, x2)
X_star = np.array([xv.ravel(), yv.ravel()]).transpose()

plt.plot(X_star[:,0], X_star[:,1], 'k+')

# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')
Out[4]:

Try some supervised learning methods

Any classification or regression method could be used here, though how useful they are to this problem depends on what assumptions the method is based on.

  • Do you think deep learning methods would work in this case? Why or why not?
In [31]:
import sklearn.neighbors
import sklearn.linear_model

model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
#model = sklearn.linear_model.LogisticRegression()

X = df.loc[:, ['lon','lat']]
f = df.cbsd

model.fit(X, f)

f_star = model.predict_proba(X_star)

plt.scatter(X_star[:,0], X_star[:,1], c=f_star[:,1], cmap=plt.cm.plasma_r,
            s=20, alpha=.8, edgecolors='none', marker='s')
plt.gcf().set_size_inches(10, 10)

# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')
Out[31]:

Logistic regression doesn't seem very useful here. Nearest neighbours looks a bit more plausible.

  • Are there any drawbacks with these methods?
  • Suggestions for other methods to try?

Modelling uncertainty

Disease maps are used to make decisions with real consequences, such as how to allocate scarce resources or whether to destroy crops in a disease frontier area. We should at least know how confident we are in our predictions.

We can model crop disease observation data by assuming the level of disease incidence $f_i$ at a location $\mathbf{x}_i$ is strongly related to the level of disease incidence $f_j$ at location $\mathbf{x}_j$ if these locations are close together.

First we have to say what 'close together' means. The most obvious thing to look at is the squared Euclidean distance:

\begin{equation} \| \mathbf{x}_i - \mathbf{x}_j \|^2 \end{equation}

However the Euclidean distance is in an arbitrary scale. Using our knowledge of the problem, we can define a 'length scale' $\ell$ so that we have some calibration:

\begin{equation} \frac{ \| \mathbf{x}_i - \mathbf{x}_j \|^2 }{\ell^2 } \end{equation}

Now that we've said what 'close together' means, let's look at what 'strongly related' means. We can convert our distance function into a new expression which maps everything into the range 0 to 1:

\begin{equation} k(\mathbf{x}_i, \mathbf{x}_j) = \exp \left( - \frac{ \| \mathbf{x}_i - \mathbf{x}_j \|^2 }{\ell^2 } \right) \end{equation}

If the two points are at the same location, then we get $k(\mathbf{x}_i, \mathbf{x}_j) = 1$, and as they get further apart then it gets closer (but never quite reaches) 0.

We can call $k(\mathbf{x}_i, \mathbf{x}_j)$ our covariance function, and set $\ell$ with knowledge about the infection mechanism of the disease. This gives us a Gaussian process. The further away the locations are, the less they affect each other, and so we have a basis for calculating how confident our predictions are at each point.

Visualise Gaussian process predictions

In [28]:
# The units are decimal degrees, where 1 degree equals approx 111km at the equator.
kernel = GPy.kern.RBF(2, lengthscale=1)

# Create a simple GP model, which aims to predict CBSD status given location.
model = GPy.models.GPRegression(X, np.array([f]).transpose(),kernel)

# OPTIONAL: optimise the parameters
#model.optimize(messages=True,max_f_eval = 1000)

# Make predictions at each point on the grid
f_star, S = model.predict(X_star)

# It's possible to predict negative values, so only keep the positive part.
f_star[f_star < 0] = 0

marker_scale_factor = .2
plt.scatter(X_star[:, 0], X_star[:, 1], c=f_star[:, 0], s=(marker_scale_factor / (S - 1)),
            cmap=plt.cm.plasma_r, alpha=.8, edgecolors='none', marker='s')
plt.gcf().set_size_inches(10, 10)

# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')
Out[28]:
  • Plotting this way, we can see the covariance of the predictions as well as the mean.
  • Different length scales $\ell$ give us different maps.
  • Labels don't have to be 0 or 1; they could represent the proportion of infected plants at a location.
  • Could the covariance function be modified to take other aspects of the problem into account?